Meaning of Balanced datasets - machine-learning

I am researching some information about audio classification, more specifically: balanced vs. imbalanced audio datasets.
So, assuming here I have two folders of two datasets' classes: Car sounds and Motorcycle sounds, car class folder has 1000 .wav and motorcycle folder has 1000 .wav too. Does that mean I have a balanced datasets just because the numbers are equal? What if the total size of .wav files inside car class is 500 Mb and the other one is 200 Mb? Okay, assuming both of them have same folder size, yet what if the time duration of individual audio clips of car recordings are longer than others in the motorcycle class?

Balanced dataset means the same number from both classes. Often shorter data is padded to make it the same length to fit into classifiers. I don't have a background in audio so I can't say if padding is the norm, but if your network has some way of reconciling different input lengths that does not involve creating more inputs it will be balanced 1000-1000.

Related

Why does object detection result in multiple found objects?

I trained an object detector with CreateML and when I test the model in CreateML, I get a high number of identified objects:
Notes:
The model was trained on a small data set of ~30 images with that particular label face-gendermale occuring ~20 times.
Each training image has 1-3 labelled objects.
There are 5 label total.
Questions:
Is that expected or is there something wrong with the model?
If this is expected, how should I evaluate these multiple results or even count the number of objects found in the model?
Cross-posted in Apple Developer Forums. Photo of man © Jason Stitt | Dreamstime.com
A typical object detection model with make about 1000 predictions for every image (although it can be much more depending on the model architecture). Most of these predictions have very low confidence, so they are filtered out. Then the ones that are left over are sent through non-maximum suppression (NMS), which removes bounding boxes that overlap too much.
In your case, it seems that the threshold for NMS is too low (or too high), because many overlapping boxes survive.
However, it also seems that the model hasn't been trained very well yet, probably because you used very few images.

Preparing MFCC audio feature- Should all WAV files be at same length?

I would like to prepare an Audio-dataset for a machine learning model.
Each .wav file should be represented as an MFCC image.
While all of the images will have the same MFCC amount (= 20), the lengths of the .wav
files are between 3-5 seconds.
Should I manipulate all the .wav files to have the same length?
Should I normalize the MFCC values (between 0 and 1) prior to plotting?
Are there any important steps to do with such data before passing it to a machine learning model?
Further reading links would also be appreciated.
Most classifiers will require a fixed size input, yes. You can do this by cutting or padding the MFCCs after you have calculated them. No need to manipulate the WAV/waveform, per se.
Another approach is to split your audio files into multiple analysis windows, say 1 seconds each. A 3 second file can then be done with 3 predictions (or more if one uses overlap), while a 5 second file would take 5 predictions (or more). Then to get clip-wide prediction, one would merge predictions over all windows in the clip. The easy ways to train in this way requires assuming that a label given for the clip is valid for each individual analysis window.

Identifying specific parts of a document using CRF

My goal is given a set of documents (mostly in financial domain), we need to identify specific parts of it like Company Name or Type of the document, etc.
The training is assumed to be done on acouple of 100's of documents. Obviously I would have a skewed class distribution (with None dominating around 99.9% of the examples).
I plan to use CRF (CRFsuite on Sklearn) and have gone through the necessary literature . I needed some advice on the following fronts :
Will the dataset be sufficient to train the CRF? Considering each document can be split into around 100 tokens (each token being a training instance) , we would get 10000 instances in total.
Will the data set be too skewed for training a CRF? For ex: for 100 documents I would have around 400 instances of given class and around 8000 instances of None
Nobody knows that, you have to try it on your dataset, check resulting quality, maybe inspect the CRF model (e.g. https://github.com/TeamHG-Memex/eli5 has sklearn-crfsuite support - a shameless plug), try to come up with better features or decide to annotate more examples, etc. This is just a general data science work. Dataset size looks on a lower side, but depending on how structured is the data and how good are features a few hundred documents may be enough to get started. As the dataset is small, you may have to invest more time in feature engineering.
I don't think class imbalance would be a problem, at least it is unlikely to be your main problem.

Optimizing Neural Network Input for Convergence

I'm building a neural network for Image classificaion/recognition. There are 1000 images (30x30 greyscale) for each of the 10 classes. Images of different classes are placed in different folders. I'm planning to use Back-propagation algorithm to train the net.
Does the order in which I feed training examples into the net affect it's convergence?
Should I feed training examples in random order?
First I will answer your question
Yes it will affect it's convergence
Yes it's encouraged to do that, it's called randomized arrangement
But why?
referenced from here
A common example in most ANN software is IRIS data, where you have 150 instances comprising your dataset. These are about three different types of Iris flowers (Versicola, Virginics, and Setosa). The data set contains measurements of four variables (sepal length and width, and petal length and width). The cases are arranged so that the first case 50 cases belong to Setosa, while cases 51-100 belong to Versicola, and the rest belong to Virginica. Now, what you do not want to do is present them to the network in that order. In other words, you do not want the network to see all 50 instances in Versicola class, then all 50 in Virginics class, then all 50 in Setosa class. Without randomization your training set wont represent all the classes and, hence, no convergence, and will fail to generalize.
Another example, in the past I also have 100 images for each Alphabets (26 classes),
When I trained them ordered (per alphabet), it failed to converged but after I randomized it got converged easily because the neural network can generalize the alphabets.

What are the different strategies for detecting noisy data in a pile of text?

I have around 10 GB of text from which I extract features based on bag of words model. The problem is that the feature space is very high dimensional(1 million words) and I can not discard words based on the count of each word as both the most and least occurring words are important of the model to perform better. What are the different strategies for reducing the size of the training data and number of features while still maintaining/improving the model performance?
Edit:
I want to reduce the size of the training data both because of overfitting and training time. I am using FastRank(Boosted trees) as my ML model. My machine has a core i5 processor running with 8GB RAM. The number of training instances are of the order of 700-800 million. Along with processing it takes more than an hour for the model to train. I currently do random sampling of the training and test data so as to reduce the size to 700MB or so, so that the training of the model finishes in minutes.
I'm not totally sure if this will help you because I dont know what your study is about, but if there is a logical way to divide up the 10Gigs of Text, (into documents or paragraphs) perhaps, you can try tf-idf. http://en.wikipedia.org/wiki/Tf%E2%80%93idf
This will allow you to discard words that appear very often across all partitions, and usually(the understanding is) that they dont contribute significant value to the overall document/paragraph etc.
And if your only requirement is to keep the most and least frequent words - would a standard distribution of the word frequencies help? Get rid of the average and 1 standard deviation(or whatever number you see fit).

Resources